IGNITE-22662 : Snapshot check as distributed process #11391

Vladsz83 · 2024-06-13T21:41:00Z

No description provided.

…DistrProc

…DistrProc # Conflicts: # modules/core/src/test/java/org/apache/ignite/internal/processors/cache/persistence/snapshot/IgniteClusterSnapshotCheckTest.java

…cess

...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java

timoninmaxim · 2024-08-07T07:14:40Z

...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java

+        IgniteSnapshotManager snpMgr = kctx.cache().context().snapshotMgr();
+
+        if (allRestoreHandlers) {
+            workingFut = CompletableFuture.supplyAsync(() -> {


We should specify the snapshot executor for the async job

I'm afraid not if we use the same executor somewhere inside the task. I tried. The executor might be confgured with single thred. This thread is blocked with the waiting task. No thread left for the workers. Tests like testChangeSnapshotTransferRateInRuntime() hangs.

...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java

timoninmaxim · 2024-08-07T07:38:52Z

...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java

+    void interrupt(Throwable err) {
+        contexts.forEach((snpName, ctx) -> {
+            if (ctx.fut != null)
+                ctx.fut.onDone(err);


It is possible interrupt will not interrupt anything. You still have a period of time:

First phase don't start and futures are nulls.

First phase finished, second phase still not started and futures are nulls.

I again recommend you create a future that lives on every node during all process and phase futures listen it.

interrupt() is a node stopping. SnapshotManager stops, its thread pool stops. Discovery should not work, process phases should start and be able to work because SnapshotManager stops. No problem is expected. Also, keeping single future requires resseting it. The same race. Canceling a finished future does nothing. Even not soring the exception and a canceled flag. reset() would revive future and restart it.

...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java

timoninmaxim · 2024-08-09T14:56:31Z

...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java

+            if (ctx.fut != null)
+                ctx.fut.onDone(err);
+
+            it.remove();


Why do you remove context here? Reduce phase is already responsible for clean up.

Because we do not store error any more. We cannot recognize that we should not work at the reduce. The phase won't work if there is no context. If it is, the phase works. Also because stop future here.

timoninmaxim · 2024-08-09T15:31:17Z

...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java

+        SnapshotCheckContext ctx;
+
+        // The context can be null, if a required node leaves before this phase.
+        if (!req.nodes().contains(kctx.localNodeId()) || (ctx = context(null, req.requestId())) == null || ctx.locMeta == null)


does ctx.locMeta == null cover case !req.nodes().contains(kctx.localNodeId())?

No. We've added client-initiator to the required nodes. It has no meta. Previously, the required nodes were only data nodes. Also snapshot might be placed in any manner to the baseline nodes and/or be restored from another cluster. Any node may miss snapshot meta. Test testRestoreFromAnEmptyNode() show case like that.

...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java

timoninmaxim · 2024-08-15T11:07:17Z

...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java

+     *
+     * @param err The interrupt reason.
+     */
+    void interrupt(Throwable err) {


This method is invoked after IgniteSnapshotManager#busyLock acquired. But checking snapshot doesn't check this lock. Looks like all this collections aren't synchronized with stopping node.

This method shoud be be renamed to 'onStop()'. Everything is stopping. SnapshotManager is stopping, thread pools is stopping. Discovery should not accept or process messages. No another check process must be able to start. Even it starts, it must not be able to work due to stopping thread pools. No problems are excpected.

...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java

timoninmaxim · 2024-08-15T11:51:54Z

...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java

+
+        // A not required node can leave the cluster and its result can be null.
+        return results.entrySet().stream()
+            .filter(e -> requiredNodes.contains(e.getKey()) && e.getValue() != null)


Looks like we already check requiredNodes in assert in the reduceValidatePartsAndFinish?

No. Results can contain nulls, come from not required, not a data nodes. NPEs will arise.

...n/java/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotChecker.java

timoninmaxim · 2024-08-15T14:23:05Z

...n/java/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotChecker.java

+        boolean skipPartsHashes
+    ) {
+        try {
+            return checkPartitions(meta, snpDir, groups, forCreation, checkParts, skipPartsHashes).get();


Same, let's invoker calls get() with timeout

We have no any timeout on snapshot validation. User cannot define it. Even can't cancel yet. There is no any value to pass.

timoninmaxim · 2024-08-15T14:23:59Z

...n/java/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotChecker.java

+            catch (IgniteCheckedException e) {
+                throw new IgniteException("Failed to check partitions of snapshot '" + meta.snapshotName() + "'.", e);
+            }
+        });


snapshot executor?

Nope. The check funtions uses it, the same executor. If it configured with just 1 thread (setSnapshotThreadPoolSize(1)), we'll freeze here. Tests like testChangeSnapshotTransferRateInRuntime() would hang.

modules/core/src/main/java/org/apache/ignite/internal/util/distributed/DistributedProcess.java

...he/ignite/internal/processors/cache/persistence/snapshot/IgniteClusterSnapshotCheckTest.java

…DistrProc

...a/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotCheckProcess.java

...n/java/org/apache/ignite/internal/processors/cache/persistence/snapshot/SnapshotChecker.java

This reverts commit 11f0be3.

Vladsz83 added 24 commits June 13, 2024 19:03

raw

1e4cee5

test fixes

c5916ea

raw tests

a995a1b

Merge branch 'master' into checkSnpAsDistrProc

57d92b3

raw tests

bdc625e

Merge branch 'master' into checkSnpAsDistrProc

01b5d73

Merge remote-tracking branch 'my/checkSnpAsDistrProc' into checkSnpAs…

f5c46c0

…DistrProc

stabilized tests alpha

c7636c5

better tests

87e0740

better tests

15cdc22

+ raw node stopped tests

a908f73

+ non-baseline tests

7154eb8

+ new baseline node test

3940fd2

+ tests

338d0cd

Merge remote-tracking branch 'my/checkSnpAsDistrProc' into checkSnpAs…

3dd6431

…DistrProc # Conflicts: # modules/core/src/test/java/org/apache/ignite/internal/processors/cache/persistence/snapshot/IgniteClusterSnapshotCheckTest.java

Merge branch 'master' into checkSnpAsDistrProc

88cf1e7

beta

7e3290d

test fixes

59167e5

test fixes

69c9a2e

test fixes

6eb067a

test fixes. manual review

4002bdb

test fixes. manual review

a11d453

fix

0f2903d

fix

68e5e48

Vladsz83 changed the title ~~Check snapshot as distributed process~~ IGNITE-22662 : Snapshot check as distributed process Jul 4, 2024

Vladsz83 added 4 commits July 8, 2024 13:54

errors processing fix

fe2f7ae

Merge branch 'master' into IGNITE-22662_SnapshotCheckToDistributedPro…

f8575ca

…cess

renaming

feefce9

renaming

3b32f1a

Vladsz83 changed the base branch from master to IGNITE-22662__snapshot_refactoring July 10, 2024 10:29

timoninmaxim requested changes Aug 7, 2024

View reviewed changes

Vladsz83 added 2 commits August 7, 2024 13:25

review fixes

bbc4c75

+ctx cleaning

d6efa75

timoninmaxim requested changes Aug 9, 2024

View reviewed changes

Vladsz83 added 8 commits August 9, 2024 18:39

fix

aba4ead

codestyle

f754569

review fixes

16b6c2b

NPE fix

857b9f7

codestyle

a1c41a4

+ locProcFut

da9f916

fix

cdc3e0e

renaming

744b524

timoninmaxim reviewed Aug 15, 2024

View reviewed changes

Vladsz83 added 8 commits August 15, 2024 20:34

review fixes minors

e4962b3

checkstyle

dfc2b46

test fixes

db4aa41

test fix

37defc5

major future hang fix

beaa40c

Merge remote-tracking branch 'my/checkSnpAsDistrProc' into checkSnpAs…

fae23e6

…DistrProc

checkstyle fix

e281152

log spam fix

57e2fea

timoninmaxim reviewed Aug 22, 2024

View reviewed changes

Vladsz83 added 6 commits August 22, 2024 14:29

review fix. revert cleaning on node leave

8d2069d

- isStopping

11f0be3

Revert "- isStopping"

9c1918d

This reverts commit 11f0be3.

+ sync on stop

0db7831

+ sync on stopping

e0706b1

fix

b0db74d

timoninmaxim merged commit 8ef9bcf into apache:IGNITE-22662__snapshot_refactoring Aug 22, 2024

Vladsz83 deleted the checkSnpAsDistrProc branch August 22, 2024 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IGNITE-22662 : Snapshot check as distributed process #11391

IGNITE-22662 : Snapshot check as distributed process #11391

Vladsz83 commented Jun 13, 2024

timoninmaxim Aug 7, 2024

Vladsz83 Aug 7, 2024

timoninmaxim Aug 7, 2024

Vladsz83 Aug 7, 2024 •

edited

Loading

timoninmaxim Aug 9, 2024

Vladsz83 Aug 9, 2024

timoninmaxim Aug 9, 2024

Vladsz83 Aug 9, 2024

timoninmaxim Aug 15, 2024

Vladsz83 Aug 15, 2024 •

edited

Loading

timoninmaxim Aug 15, 2024

Vladsz83 Aug 15, 2024

timoninmaxim Aug 15, 2024

Vladsz83 Aug 15, 2024

timoninmaxim Aug 15, 2024

Vladsz83 Aug 15, 2024

IGNITE-22662 : Snapshot check as distributed process #11391

IGNITE-22662 : Snapshot check as distributed process #11391

Conversation

Vladsz83 commented Jun 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vladsz83 Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vladsz83 Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vladsz83 Aug 7, 2024 •

edited

Loading

Vladsz83 Aug 15, 2024 •

edited

Loading